17 research outputs found
Planting a SEED of Vision in Large Language Model
We present SEED, an elaborate image tokenizer that empowers Large Language
Models (LLMs) with the emergent ability to SEE and Draw at the same time.
Research on image tokenizers has previously reached an impasse, as frameworks
employing quantized visual tokens have lost prominence due to subpar
performance and convergence in multimodal comprehension (compared to BLIP-2,
etc.) or generation (compared to Stable Diffusion, etc.). Despite the
limitations, we remain confident in its natural capacity to unify visual and
textual representations, facilitating scalable multimodal training with LLM's
original recipe. In this study, we identify two crucial principles for the
architecture and training of SEED that effectively ease subsequent alignment
with LLMs. (1) Image tokens should be independent of 2D physical patch
positions and instead be produced with a 1D causal dependency, exhibiting
intrinsic interdependence that aligns with the left-to-right autoregressive
prediction mechanism in LLMs. (2) Image tokens should capture high-level
semantics consistent with the degree of semantic abstraction in words, and be
optimized for both discriminativeness and reconstruction during the tokenizer
training phase. As a result, the off-the-shelf LLM is able to perform both
image-to-text and text-to-image generation by incorporating our SEED through
efficient LoRA tuning. Comprehensive multimodal pretraining and instruction
tuning, which may yield improved results, are reserved for future
investigation. This version of SEED was trained in 5.7 days using only 64 V100
GPUs and 5M publicly available image-text pairs. Our preliminary study
emphasizes the great potential of discrete visual tokens in versatile
multimodal LLMs and the importance of proper image tokenizers in broader
research.Comment: Technical Report; Project released at:
https://github.com/AILab-CVC/SEE
Contrastive Masked Autoencoders for Self-Supervised Video Hashing
Self-Supervised Video Hashing (SSVH) models learn to generate short binary
representations for videos without ground-truth supervision, facilitating
large-scale video retrieval efficiency and attracting increasing research
attention. The success of SSVH lies in the understanding of video content and
the ability to capture the semantic relation among unlabeled videos. Typically,
state-of-the-art SSVH methods consider these two points in a two-stage training
pipeline, where they firstly train an auxiliary network by instance-wise
mask-and-predict tasks and secondly train a hashing model to preserve the
pseudo-neighborhood structure transferred from the auxiliary network. This
consecutive training strategy is inflexible and also unnecessary. In this
paper, we propose a simple yet effective one-stage SSVH method called ConMH,
which incorporates video semantic information and video similarity relationship
understanding in a single stage. To capture video semantic information for
better hashing learning, we adopt an encoder-decoder structure to reconstruct
the video from its temporal-masked frames. Particularly, we find that a higher
masking ratio helps video understanding. Besides, we fully exploit the
similarity relationship between videos by maximizing agreement between two
augmented views of a video, which contributes to more discriminative and robust
hash codes. Extensive experiments on three large-scale video datasets (i.e.,
FCVID, ActivityNet and YFCC) indicate that ConMH achieves state-of-the-art
results. Code is available at https://github.com/huangmozhi9527/ConMH.Comment: This work is accepted by the AAAI 2023. 9 pages, 6 figures, 6 table
Learning Transferable Spatiotemporal Representations from Natural Script Knowledge
Pre-training on large-scale video data has become a common recipe for
learning transferable spatiotemporal representations in recent years. Despite
some progress, existing methods are mostly limited to highly curated datasets
(e.g., K400) and exhibit unsatisfactory out-of-the-box representations. We
argue that it is due to the fact that they only capture pixel-level knowledge
rather than spatiotemporal commonsense, which is far away from cognition-level
video understanding. Inspired by the great success of image-text pre-training
(e.g., CLIP), we take the first step to exploit language semantics to boost
transferable spatiotemporal representation learning. We introduce a new pretext
task, Turning to Video for Transcript Sorting (TVTS), which sorts shuffled ASR
scripts by attending to learned video representations. We do not rely on
descriptive captions and learn purely from video, i.e., leveraging the natural
transcribed speech knowledge to provide noisy but useful semantics over time.
Furthermore, rather than the simple concept learning in vision-caption
contrast, we encourage cognition-level temporal commonsense reasoning via
narrative reorganization. The advantages enable our model to contextualize what
is happening like human beings and seamlessly apply to large-scale uncurated
video data in the real world. Note that our method differs from ones designed
for video-text alignment (e.g., Frozen) and multimodal representation learning
(e.g., Merlot). Our method demonstrates strong out-of-the-box spatiotemporal
representations on diverse video benchmarks, e.g., +13.6% gains over VideoMAE
on SSV2 via linear probing
The final step effect
Suppose you need to complete a task of 5 steps, each of which has equal difficulty and pass rate. You somehow have a privilege that can ensure you pass one of the steps, but you need to decide which step to be privileged before you start the task. Which step do you want to privilege? Mathematically speaking, the effect of each step on the final outcome is identical, and so there seems to be no prima facie reason for a preference. Five studies were conducted to explore this issue. In Study 1, participants could place the privilege on any of steps 1–5. Participants were most inclined to privilege step 5. In Study 2, participants needed to pay some money to purchase the privilege for steps 1–5, respectively. Participants would pay most money for step 5. Study 3 directly reminded participants that the probability of success of the whole task is mathematically the same, no matter on which step the privilege is placed, but most of the participants still prefer to privilege the final step. Study 4 supposed that the outcomes of all steps were not announced until all steps were finished, and asked how painful participants would feel if they passed all steps but one. People thought they would feel most painful when they failed at the final step. In Study 5, an implicit association test showed that people associated the first step with easy and the final step with hard. These results demonstrated the phenomenon of the final step effect and suggested that both anticipated painfulness and stereotype may play a role in this phenomenon
MISSRec: Pre-training and Transferring Multi-modal Interest-aware Sequence Representation for Recommendation
The goal of sequential recommendation (SR) is to predict a user's potential
interested items based on her/his historical interaction sequences. Most
existing sequential recommenders are developed based on ID features, which,
despite their widespread use, often underperform with sparse IDs and struggle
with the cold-start problem. Besides, inconsistent ID mappings hinder the
model's transferability, isolating similar recommendation domains that could
have been co-optimized. This paper aims to address these issues by exploring
the potential of multi-modal information in learning robust and generalizable
sequence representations. We propose MISSRec, a multi-modal pre-training and
transfer learning framework for SR. On the user side, we design a
Transformer-based encoder-decoder model, where the contextual encoder learns to
capture the sequence-level multi-modal synergy while a novel interest-aware
decoder is developed to grasp item-modality-interest relations for better
sequence representation. On the candidate item side, we adopt a dynamic fusion
module to produce user-adaptive item representation, providing more precise
matching between users and items. We pre-train the model with contrastive
learning objectives and fine-tune it in an efficient manner. Extensive
experiments demonstrate the effectiveness and flexibility of MISSRec, promising
an practical solution for real-world recommendation scenarios.Comment: Accepted to ACM MM 202
An MEC Architecture-Oriented Improved RRT Algorithm for Regional Trajectory Planning
Multi-access Edge Computing (MEC), which could provide real-time computing ability, is considered as an effective approach to improve performance of Vehicular Ad Hoc Network (VANET). MEC could process regional vehicles information and generate real-time road hazard features, which could be used to realize trajectory planning progress of vehicles. In this paper, an MEC-oriented VANET infrastructure is presented, and a road hazard feature-based trajectory planning method is proposed. Back Propagation (BP) neural network is employed to predict road hazard feature changing, while a hazard-based cost function is defined. Then, an improved Rapidly Exploring Random Tree (RRT) algorithm is proposed for novel regional trajectory planning. A joint simulation is done based on SUMO and NS3 platforms. Simulation results verify the effectiveness and stability of the proposed algorithm
Coexistence of Antibiotic Resistance Genes and Virulence Factors Deciphered by Large-Scale Complete Genome Analysis
Widespread use of antibiotics has enhanced the evolution of highly resilient pathogens and poses a severe risk to human health via coselection of antibiotic resistance genes (ARGs) and virulence factors (VFs). In this study, we rigorously evaluate the abundance relationship and physical linkage between ARGs and VFs by performing a comprehensive analysis of 9,070 bacterial genomes isolated from multiple species and hosts. The coexistence of ARGs and VFs was observed in bacteria across distinct phyla, pathogenicities, and habitats, especially among human-associated pathogens. The coexistence patterns of gene elements in different habitats and pathogenicity groups were similar, presumably due to frequent gene transfer. A shorter intergenic distance between mobile genetic elements and ARGs/VFs was detected in human/animal-associated bacteria, indicating a higher transfer potential. Increased accumulation of exogenous ARGs/VFs in human pathogens highlights the importance of gene acquisition in the evolution of human commensal bacteria. Overall, the findings provide insights into the genic features of combinations of ARG-VF and expand our understanding of ARG-VF coexistence in bacteria. IMPORTANCE Antibiotic resistance has become a serious global health concern. Despite numerous case studies, a comprehensive analysis of ARG and VF coexistence in bacteria is lacking. In this study, we explore the coexistence profiles of ARGs and VFs in diverse categories of bacteria by using a high-resolution bioinformatics approach. We also provide compelling evidence of unique ARG-VF gene pairs coexisting in specific bacterial genomes and reveal the potential risk associated with the coexistence of ARGs and VFs in organisms in both clinical settings and environments